Prosper Loan data exploration

by Andy Watson

Preliminary Wrangling

Little wrangling is needed. Udacity provided us with a clean csv file from Prosper Financials. But we will change the ListingCreationDate column from object to datetime format. We are also going to manipulate the dataframe quite a bit prepare it for the data analysis steps.

Prepare Data for Analysis

Below I make a few adjustments to the loan data frame.

  1. Create a data frame for loans created up through 2008. I will limit my investigations through 2008 because all those loans terms should've been completed by 2014 (the longest term is 5 years and 2013-5=2008). I will also change the ListingCreationDate variable to datetime.

  2. Create "Successful" (rename "Completed") and "Unsuccessful" (including both "Chargedoff" and "Defaulted") columns.

  3. Create an ROI (return on investment) column.

  4. Turn the ordinal variables (that I will be investigating) into categorical variables.

  5. Create lists of each type of variable for the variables I'll be investigating.

ROI = (LP_CustomerPyaments - LoanOrignalAmount - LP_ServiceFees - LP_CollectionFees + LP_NonPrincipalRecoverypayments)/LoanOriginalAmount

What is the structure of your dataset?

The file contains data on 113,937 loans (rows) with 81 variables (columns).

Variables include loan amount, interest rate, information about borrower's financial history, and current loan status.

The data was collected from 2005 through 2014.

What is/are the main feature(s) of interest in your dataset?

My goal here is to find variables that can accurately predict a loan's outcome.

In order to make the project manageable, I will restrict my focus to information about the borrower.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

Quantitative Variables:

I expect the borrower's credit rating score to accurately predict loan outcomes. The credit rating is created by financial institutions to assess a borrower's credit risk based on current and historical financial data about the borrower.

Borrower income range and debt to income ratio should also correlate strongly with loan outcomes.

Categorical Variables:

I will also explore two categorical variables: borrower's state, and borrower occupation.

Loan outcome data

The first primary variable I tried was LoanStatus: the current status of the loan:

Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.

In order to avoid comparing new loans with older ones (the new ones will of course have a much lower default rate and a much higher in-progress rate), I will restrict my analysis to loans that originated from 2005 through 2008. The loans originating in this time period were all either 12 or 36 month loans. Therefore, according to the original loan terms, they should've been completed long before 2014, when they stopped updating the data.

After removing the loans outside of my selected date range, I was left with 29056 loans.

I created variable called NewLoanStatus that divided the LoansStatus variable into loans that had been completed successfully, and those that had not been completed successfully.

Specifically, I combined the Chargedoff and Defaulted categories into one category 'Unsuccessful' and the Completed category I renamed 'Successful'. (The only other category pertaining to loans in my time range was Cancelled. I wasn't sure if a cancelled loan should count as a success or a failure, and there were only 5 of them anyway, so I ignored that category.)

Soon I doubted the sufficiency of this initial 'Unsuccessful' / 'Successful' metric. It didn't account for how much of a loan was paid off before it was cancelled or defaulted.

So I created another primary variable:

I created a new column called 'ROI', where I calculated the return on investment using this formula:

(loans_to_2008['LP_CustomerPayments'] - loans_to_2008['LoanOriginalAmount'] - loans_to_2008['LP_ServiceFees'] - loans_to_2008['LP_CollectionFees'] + loans_to_2008['LP_NonPrincipalRecoverypayments'])/loans_to_2008['LoanOriginalAmount']

Preliminary Investigation of Data

Univariate Exploration

First: What is the distribution of loan outcomes?

The Completed category dwarfs all other categories.

I think I should limit my investigations to 2008.

Because all those loans terms should've been completed by 2014 (the longest term is 5 years and 2013-5=2008).

I looked at the loans created from 2005 to 2008.

All of those loans's original terms were up by 2014, when they stopped updating our data. Most were completed, but a significate number where charged off or defaulted. Only 5 loans were cancelled.

I googled Default and Charged Off and found that Default status is generally declared when a loan is 121 past due. The loan is generally charged off (ie: all hope is abandoned for the recovery of the debt) after 150 days past due.

I should check to see what percentage of the defaulted loans are 5-year loans from 2008. But first I want to look at the above data in terms of percentages.

From 2005 through 2008, Prosper made 29056 loans. By 2014, 18385 loans were completed, 6654 were charged off, 4012 were defaulted, and 5 were cancelled.

Oh!

All these loans were supposed to be paid in either one or three years. So at the very latest, according to the original loan terms, all loans should've been paid off by 2011 (since the loans originated from 2005 through 208).

I don't understand how any could still be in the Defaulted, rather than the ChargedOff category in 2014.

I guess these are cases when people had been paying a little for a long time but then stopped paying altogether a little before they stopped collecting data? How could there be so many loans that were 121 days past due but not yet 150 days past due? Are we to believe that in 2014 there were 4012 loans that were over 121 days past due or less than 150 days past due? It doesn't make sense.

In any case, I think it is safe to say that a loan that is either still defaulted or completely charged off several (or more) years after it should've been completely paid off was an unsuccessful loan.

So let's combine the Chargedoff and Defaulted categories into one category "UnSuccesful Loans", and rename "Completed" as "Successful Loans".

Creating the NewLoanStatus Variable

I am going to combine the chargedoff and defaulted LoanStatus categories into a category called Unsuccessful.

The completed loans I will call Successful.

Calculating the percent of Successful loans

I want to look at the percentages of loans that ended up being either successful or unsuccessful.

If we declare all loans that originated between 2005 and 2008 and that were completed by 2014 "Successful", and the rest of those loans (either Defaulted or Chargedoff, and ignoring the 5 Cancelled loans) "Unsuccessful, about 63% of the loans were successful.

But as I did this, I began to wonder how things would look if we factored in how much money was recovered in the various "Unsuccessful" loans. So I created a return on investment (ROI) column that factored in payments and recoveries.

Ther ROI variable was calculated like this:

loans_to_2008['ROI'] = (loans_to_2008['LP_CustomerPayments'] - loans_to_2008['LoanOriginalAmount'] - loans_to_2008['LP_ServiceFees'] - loans_to_2008['LP_CollectionFees'] + loans_to_2008['LP_NonPrincipalRecoverypayments'])/loans_to_2008['LoanOriginalAmount']

Graphing the ROI (return on investment) variable

Note: See the XYZ Section for a discussion of how I came up with this metric and See the top "Prepare Data for Analysis" section to see how I calculated the ROI variable.

The bulk of the loans seem to have either about a 0%-10%, 10%-20%,or 20%-30% return (each of these categories contains over 4000 loans). Fewer, but a still substantial amount of loans have a 30%-40% return (around 3000 loans). About 1300 loans had a 40%-50% return, and then the numbers drop off smoothly for 60% and 70% returns. ROIs higher than that are very rare in the data set.

On the negative side, the -100%, -90%, and -80% ROI columns each contain about 1000 loans, and the counts drop off gradually from there to 0, where there appear to be about 500 loans where the lender broke even.

Below the counts within each of these ranges are calculated programatically.

The most common return outcomes are 0-10% (4296 loans), 10-20% (5317 loans), 20-30% (5051 loans), 30-40% (2829) loans, and 40-50% (1324 loans). But there were six negative categories that had close to or more than 1000 loans. And the rest of the negative outcomes were over 500 -- except for the -100 through -110% category, which only 244 loans.

OK, never mind. That 8.888e-16 must be how pd.cut() gets around zero. I don't understand why it is afraid of zero, but I'll leave it alone.

Distributions of a predictor variables

Now I'll look at the distributions of variables whose effect on loan outcomes I want to study.

Income Ranges

I want to put these columns in order from highest to lowest.

First I need to see all the column names clearly.

Of the 29058 loans made by Prosper from 2005 through 2008:

Around 8000 loans went to people at the moderate-to-low income range of \$25K-\\$49,999. "Not Displayed" had a similar amount. The next highest category is about 5500 loans in the moderate-to-high \$50K-\\$74,999 range. About 500 loans were made to people with no income and even fewer to unemployed people. Each of the three remaining categories (one under \$25K and two over \\$75K) accounted for 3000 loans.

It would be interesting to see how the percentage of loans given to people at various income levels relates to the percentage of people in the country at those various income ranges. But that is outside the scope of this project.

We can say that there is a fair amount of data for all the income ranges (at least 2,000 loans for all categories except for the \$0 income; and much more for people with in the medium income ranges). Enough that the information should be statistically significant.

I am surprised how many incomes were not recorded. I'm not sure what to do with that category. I will perhaps exclude those loans when looking for trends in the bivariate and multivariate graphs.

Now let's look at credit grade distributions.

There seem to be few less credit ratings than total loans (29056 - 28940 = 16).

I am surprised that the AA, A, and B credit ratings total about 13,000 loans.

If you add up the income levels from \$50,000 to \\$100,000+, you get less than 10,000 total loans. Perhaps that is why there are relatively few loans with higher level credit ratings.

Of course there's the large "Not Displayed" section in the IncomeRange column. I may have to remove those people from the final analysis. They might make it difficult to speak meaningfully about income ranges in multivariate plots.

It is also interesting to see how willing Prosper is to loan to people with iffy to very low credit ratings. C is the largest category; D is the second largest category; and E is also pretty large.

I guess the Credit Rating distributions actually correlates pretty well with the Income Range distributions: The bulk of the borrowers were in medium income ranges. They probably don't get the very best loan terms. So the highest credit ratings probably went to the relatively few high income borrowers; and Bs and Cs probably went to the medium-high earners; and Cs and Ds probably went to the medium-low earners. We could investigate this when we do the bivariate graphs.

I don't know what HR means.

Debt to Income ratio.

Ratios above 1 are very rare. I guess those people are in a bad way.

Since they owe more than they make in a year.

It is weird that the largest outlier is like 9 to 10. Are we to believe there are people who owe ten times what they make in a year? And Prosper still gave them loans??

Let's zoom in to 0 to 1 range to see what kind of variation there is there.

See below a log transformation of the same data.

I don't see the advantage between that graph and one that zooms in on the 0 to 1 range and notes that there are a very few extreme outliers.

I don't think using a log transformation will help here.

It isn't just a long tale. It is a giant gap between fractions less than 1 and 10.

I fiddle around with the graph a little below, but the results are less satisfying than zooming in and noting that there are a very few outliers whose debts far exceed their incomes.

The vast majority of the loans are given to people who's debt-to-income ratio is less than 1.

Of those, there are many people with debt ratios from 0 to .1; but almost twice as many with ratios from .1 to .2; it tapers down from this height, with .3 having more loans than .1 but less than .2, and then a pretty quick drop-off.

This makes sense. Most people who need loans are exceeding their income at least somewhat, but you usually can't get a loan if you are exceeding your income by too much. I guess the bulk of lendees are living 20 to 30 percent above their incomes.

But who are these people that owe ten times what they make?

Investigating Debt to Income Ratios programmatically

Almost all the DTIR are less than 1. The relatively few borrowers with debt to income ratios above 1 tend to be people with either no income or very low income. I think the best way to deal with this is to zoom in on the area of interest and note that there are a very few outliers at higher DTIRs, and that these outliers tend to be people with either no or very low income.

Who are these 226 people who have debts 10 times their incomes???

Let's remove the 226 and the one person with like 9.7 times as much debt as income.

And see what we get.

The vast majority of borrowers have a debt-to-income ratio between 0 and 1. And within that range, there is quite a bit of variety. Perhaps by throwing out the few people with much higher debt-to-income ratios (which possibly skews and confuses the trends), DTIR can be a useful metric for predicting loan outcomes.

Amount Delinquent counts

The Amount Delinquent has an insanely long and uniformly low tale.

Most all of the amounts are around $100 or less.

But there are some that are more than $30K

I will try a log transformation.

The vast majority of borrows are \$0. After that, there is a wide distribution of values for Amount Delinquent.

A difficulty is that most Amount Delinquents are zero values, but log transformations can't handle zero values.

See? 15,944 borrowers are \$0 delinquent. 5,490 are to some degree delinquent. Perhaps if we broke this group into Delinquent and Not Delinquent, we would get a decent predictor value for loan completion (ie: maybe the delinquent borrowers are less likely to pay off this loan, too).

Homeownership, state of residency, occupation

Let's look at three categorical values: IsBorrowerHomeowner, BorrowerState, Occupation

About 20% of the lenders owned homes than didn't.

More than twice as many lenders were from CA than from the second highest ranking state (GA). CA is a very populous state. The numbers 3, 4, and 5 states are also very populous: IL, FL, TX. GA has a large population, but smaller than IL, FL, or TX.

I looked up the most populous states. https://www.statista.com/statistics/183497/population-in-the-federal-states-of-the-us/

Top 10 in Millions:

CA (39.7), TX (29.4), FL (21.7), NY (19.3), PA (12.8), IL (12.6), OH (11.7), GA (10.71), NC (10.6), MI (9.97)

Compare with the top 10 most lended-to (by Prosper 2005-2008):

CA, GA, IL, FL, TX, MI, WA, NY, OH, MO

Clearly factors besides the population of the state affect how many of its inhabitants took out a prosper loan, although many of the largest states also took out the most loans.

I don't see how useful this can be. By far the biggest category is "Other" (7296), followed by the generic "Professional" (3085). After that, only Computer Programmer, Sales - Commission and Clerical break 1000.

Credit Score Range

Credit Score Range tends to look fairly bell-shaped. Looks like a more flushed-out version of the Credit Grade distribution.

They seem very loathe to give out loans to anyone with a credit score below like 450.

Stated Monthly Income

I discovered this was easier to work with than Income Range for the correlation scores. If I'd realized that earlier, I would've just done a histogram of SMI and none for IR.

The bulk of the loans went to people making between\$2K and \\$5K per month. The proportions drop off by about 50% when you move down to the \$1K-\\$2K income range or up to the \$5K-\\$7K range. The slope continues steeply down from there.

Let's see how the StatedMonthlyIncome matches up with IncomeRange

\$0 - 576 & \\$1-24,999 - 2617 = About \$0-\\$2K/mo -> 1164 + 2689 = 3853

\$25,000-49,999 - 8008 = About \\$2K-\$4K/mo -> 5680 + 5105 = 1985

\$50,000-74,999 - 5412 = About \\$4K-\$6K/mo -> 4601 + 2624 = 7225

\$75,000-99,999 - 2417 = About \\$6K-\$9K/mo -> 1957 + 1385 + 897 = 4239

\$100,000+ - 2128 = About Everthing over $9K -> 644 + 1609 = 1253

Not displayed - 7741

Not employed - 157

SMI has more entries than the IR because IR has 7741 "Not displayed". The only time where the approximate month-based equivalent to IR seems way off is in the \$100,000+ (instead of more values, SMI has less loans here). But that we can probably attribute to the fact that I left those making up to \\$108K in the previous category (ie: the approximation was too approximate).

SMI seems better than IR since it is a much more complete dataset. If a great deal of people gave false information about how much they make each month, then SMI becomes a poor category. But for this project, let's assume the stated monthly income data is pretty accurate.

When I decided to come up with the ROI metric

Earlier I calculated that the success rate for the loans studied here (loans made from 2005 through 2008), was about 63%. That seems like a terrible failure rate, but I didn't factor in how much money the loans that were eventually defaulted or charged off had already paid back to the lender.

I think a better metric for measuring the success of the loans would be something like: (Amount Lendee paid Lender) - (Amount of Loan + Amount spent by Lendee in quest to recover Loan).

To that end, I will create the following column:

return on investment = (payments.sum - servicefees.sum - collectionfees.sum - loanamount + nonprincipalrecovery.sum + principalrecovery.sum)/loanamount

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

My focus is on whether or not a loan was successful.

I limited the loans to those begun on or prior to 2008, because I calculated that those loans should've been completed by well before the date that they stopped adding data to the dataframe. That way I can see what percentage of loans that should've been paid off were indeed paid off.

If the focus of the investigation is whether or not a loan should be completed, the most obvious variable of interestis LoanStatus.

However, of the four categories of LoanStatus that occur in the data prior to 2009, two (ChargedOff and Defaulted) seem to me to be clearly unsuccessful, and one (Completed) as clearly successful. One (Cancelled) I am not sure how to assess, and there only five rows in the entire dataset with this outcome. Accordingly, I created a new variable NewLoanStatus with two categories: Successful for Completed and Unsuccessful for ChargedOff and Defaulted.

This is the breakdown for each category:

Completed - 18385 / Chargedoff - 6654 / Defaulted - 4012 / Cancelled - 5

Which results in:

Successful - 18,385 / Unsuccessful - 10664 / Total - 29,056

That's a success rate of 18385/29056 = 63%

I would've thought the success rate would be much higher than that.

How can a lender stay in business with this kind of a success rate?

The graph was of course straight-forward. There was no need for transformations.

But I soon began to doubt that this categorization into Successful and Unsuccessful loans was an adequate metric for analyzing the success of the loans. For one thing, these calculations don't account for how much money was recovered in the "Unsuccessful" loans. Both a loan that paid the principal and a small portion of the interest and a loan that didn't pay anything at all would be labelled "Unsuccessfull", but the former hypothetical loan is much more successful than the latter one.

So I decided to look at return on investment, calculated like this:

loans_to_2008['ROI'] = (loans_to_2008['LP_CustomerPayments'] - loans_to_2008['LoanOriginalAmount'] - loans_to_2008['LP_ServiceFees'] - loans_to_2008['LP_CollectionFees'] + loans_to_2008['LP_NonPrincipalRecoverypayments'])/loans_to_2008['LoanOriginalAmount']

With that metric, I saw that the bulk of the loans had an ROI between 0+ (a touch above 0) and .3,and then tapered down to .75, with only a very few loans doing better than that on the positive side. The negative side had much less values, but they were much more uniform than the positive values. About 250 loans had an ROI of 0, and then the trend is a gradual slope upward until -1, where there are about 500 loans that I guess you could characterize as complete disasters (ie: an ROI of -1, and thus a loss of all the money lent).

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Income Range: Nothing shocking here. By far the most loans went to people making between \$25K and \\$50K; with the \$50K to \\$75K range accounting for about 3/4 as many borrowers as the \$25-\\$50K range, and the two groups of those making more and the one group of those making less than \$25-\\$75k each accounting for about 1/4 as many loans as the \$25-\\$50K group. So people with low to medium incomes accounted for the most loans, which isn't surprising. A lot of people fall into this category and their expenditures often drift beyond their incomes. The graph was straight forward. No transforamtions were needed.

Stated Monthly Income: Seems to be about in line with Income Range, but is a fuller dataset, so -- under the assumption that people's stated monthly income is reasonably close to their actual monthly income -- is the preferable income variable. Due to the very wide range of values, I used a log transformation here and got a narrow distribution that looked fairly normal (for the bulk of the data).

Debt To Income: Values generally between 0 and 1. The best graph was at bin sizes of .02. Right skewed, concave to the right, convex to the left. A fairly long tail, but I didn't see a reason to do a log transformation -- the values (between 0 and 1.5) seemed too circumstribed for that. The highpoint was around .2, and there was a lot more variety to the values greater than .2 than to the values less than .2 A couple hundred borrowers had extremely high debt-to-income ratios. These tended to be people with either no income or a very low one. To use this as a predictor variable, it would perhaps be wise to throw these extremely high debt to incomes out -- since they are a statistically insignificant number of them, and the magnitude of this DTI is so many times greater than the typical DTI (I fear these relatively few extremely high DTIs could obscure the true trends in the data).

Amount Delinquent: I did a log transformation since it had such a long tale. There was a fairly bell-shaped curve after the log transformation, with the bottom values close to zero and the top values around \$100K. HOWEVER, to do a log transformation, I had to skip over those who were $0 delinquent, and they account for 2/3 of the total. Therefore, if I use this log transform in a presentation, I'd have to name it something like "Amount Delinquent (excluding those who owed \$0)" and then prominently explain that they constitute the majority of the cases.

Credit Grade: Straight-forward graphs. At first, I was surprised at how many loans they gave to people with lower credit raitings. C was the most common rating, but D was almost as common as C, and there were a lot of loans given to people with even lower credit ratings. But then I decided that actually correlated well with the income range distribution: the better credit ratings probably went to the relatively few people at the higher income ranges.

Credit Score Range Upper: Looks fairly bell-shaped. Looks like a more detail-rich version of CG.

Borrower State: Pretty straightforward graphs. I was surprised a little by the states that had the most borrowers. California had by far the most. That wasn't too surprising, since it is such a populous state. But the second most borrowers were in Georgia, and then Illinois. Those are large states, but Georgia particularly is nowhere near number 2 in population. I guess this has somethign to do with the states where Prosper had the largest presence during the time period we're analysing.

Occupation: I don't know how useful this one can be. By far the largest category is "Other" \$7K+), and then the very vague "Professional" (\\$3K+). Then there are only three with at least \$1K, and one of them is the still very vague "Clerical".

Home Owner: Straight forward. There's about 20% people who don't own homes than who do.

Bivariate Exploration

Here we will investigate the relationships between the two possible loan-succes metrics (ROI and NewLoanStatus) and the several possible predictor variables (AmountDelinquent, DebtToIncomeRatio, StatedMonthlyIncome, the several credit grade/score variables, BorrowState, Occupation, and IsBorrowerHomeOwner).

As previously mentioned, we will restrict our investigations to predictor variables about the borrowers (so we won't include variables like APR, or other details of the loans). We are trying to see how characteristics of the lenders make the loans more or less likely to succeed.

First I will organize the predictor variables by variable type -- if for no other reason than to help myself think about them more systematically.

Debt to Income Ratio vs ROI

This tells me nothing. Most of the debt to income ratio's are less than one; and the most of those ROIs ranges from -1 to .75. The DTIR bump out past 1 from 0 to .5 and taper inward after that, shutting off around .75. It is a little surprising that higher ROIs seem to be correlated with slightly higher DTIR.

I will adjust the plot-points to see if I can see more info.

Still pretty much a tower. The vast majority of the DTIR is between 0 and .75. You can see a little bulge outward (higher DtIs) as the ROIs start becoming postitive. That bump peaks at about 1 DTIR with a ROI of about .25, and then gradually sinks back to a DTI of about .75 at an ROI of .5, and holds steady with the DTIR though about .7 ROI

And everything else is an outlier.

I'll try once more, adding jitter.

Still not very interesting. Almost all DTIR values are within the same range, and they lead to the same results. It doesn't look like DTIR is a very good predictor value.

Let's run it through a correlation function.

-.034 is not a very significant correlation.

I'll just check the pearson score after removing the outliers real quick and then stop investigating this variable.

Still not a meaningful correlation.

OK, so even if we remove the outliers, DTI doesn't seem to be a very good predictor of ROI.

A little more on DTIR and ROI -- please skip down to AmountDelinquent.

Amount Delinquent vs ROI

This also doesn't look like a very useful metric. I'll run it through the correlation function real quick.

This isn't a significant correlation. A small pearson correlation is from |.1| to |.3|

This isn't even very close to |.1|

Let's just see what happens if we remove the outliers.

Still a very low correlation score.

Ordinal Variables: Credit Grade & Income Range

Now I'll look at the two ordinal variables. CreditGrade & IncomeRange

First I'll see how they relate to each other.

Then I'll compare them to ROI.

I think this would be more meaningful if I had the percent of each income range in each credit rating.

I'll try a barplot. But that won't work if both variables are categorical. I'll try StatedMonthlyIncome instead.

That makes sense. For AA, the incomes are quite a bit higher than the rest. Then from A to B and C there is a gradual slope, dropping off more for C to D. And then for some reason a little bit up for E.

I should check to see how StatedMonthlyIncome relates to IncomeRange.

Stated Monthly Income versus Income Range

Correlated very linearly up through the \$ 75– \\$99,9999 range.

The more drastic slope from that category up to the \$100K+ category is also predictable -- because of the + at the end of the \\$100.

So we should be able to use IncomeRange and StatedMonthlyIncome interchangeably.

Let's compare ROI with these CreditRange, IncomeRange, and StatedMonthlyIncome.

CreditGrade vs ROI and IncomeRange vs ROI

What does it mean that median ROI goes slightly up from AA through D?

It makes sense that the bottom 25% of ROIs should be longer and longer (starting with lower and lower values) as the credit grades decrease: more and more loans are performing poorly.

Why does the top 25% of ROIs tend to be concentrated in a small range of values in all grades from AA through D?

What does it mean that median ROI is pretty steady across all income levels, with only slight dips registering for $0 and Not Employed?

Roughly, the bottom 25% of values increases as we move away from the \$75K-\\$99K income range (for 2008 a high but not very high income).

Excepting "Not Employed", all income ranges have lower quartiles significantly larger than their upper quartiles.

On average, AA and (to a lesser extent) D showed slight profits. A seems to have made \$0. B and C lost a little money on average. HR (what is that?) and No Credit Grade both show a substantial average loss.

These outcomes seem quite poor. The most loans given were Cs, followed closely by Ds, and then Bs. Cs have a very slight negative average, but with error bars into the positive range; Ds have a slight positive average, but with error bars into the negative range. The general picture is of profits hovering around and generally dipping below zero. Only AA had a solidly positive ROI (with error bars above 0); but there were as many HR as AA loans given out, and HR loans have an ROI of close to -20 percent. Does this graph reflect difficult years from 2005 through 2008? Due to the 2007-2008 financial crisis, and their pre-2009 practices?

Here is the number of loans at each category during the 2005-2008 period:

C 5644 / D 5151 / B 4386 / HR 3508 / AA 3508 / A 3313 / E 3289 / NC 141

I found this from Wikipedia:

" ... From 2006 to 2009 Prosper operated a variable rate model. Prosper acted as an eBay-style online auction marketplace, with lenders and borrowers ultimately determining loan rates using a Dutch auction-like system. ... On November 24, 2008, the SEC found Prosper to be in violation of the Securities Act of 1933. As a result of these findings, the SEC imposed a cease and desist order on Prosper. Due primarily to the novel nature of the peer-to-peer lending models, the SEC, after review, now treats all peer-to-peer lending transactions as sales of securities and requires that all platforms register with the SEC. ... As of August 2008, approximately 18.5% of all money loaned through Prosper from inception (February 2006) through June 2008 were in some form of delinquency. Also, more than 35% of all loans that originated in February 2007 were in some form of delinquency. ... After Prosper's relaunch in July 2009, and implementing stricter credit guidelines for borrowers[20] Prosper's loan default rate has been significantly reduced. ... "

If we are considering Credit Grade as a predictor variable for ROI, the downward slope from AA through B is encouraging, but then the ROI increase slightly from B to C and from C to D. If CG was to be a good predictor variable for ROI, we'd expect the average ROI values to decrease or increase steadily in one direction as credit scores increase or decrease.

On average, loans made to people of all income ranges lost money, with two exceptions: \$50K-\\$74,999 had a slight increase in average ROI, and \$75K-\\$99,999 had a net ROI of about $0. As with Credit Grade and ROI, it is difficult to discern much of a pattern when trying to use ROI as a function of Income Range.

Let's look at CreditGrade and Income Ranges vs ROI on violin plots.

As credit grade goes down, the tendency is to get more variety in the outcomes.

I'm guessing that this tendency, as well as the larger bottom quartiles shown in the box plots, can be attributed to the higher level of risk in loans given to people with lower credit scores. More loans fail, and there are also more spectacular failures (ie: the lender recovered very little of the lent money), but those loans that succeed tend to have higher interest rates (since people with lower credit scores are charged more interest), making for higher ROIs for the successful loans.

Both the violin plots and the box plots showed definite trends as Credit Grade increased. What does this mean? Credit Grade doesn't seem to be a very good predictor of average ROI success (at least not in this time period), but it seems to be a pretty good predictor of the overal statistical spread of the outcomes. I think maybe this is because the increasing medians and lengthening distribution bands offset each other, thus preventing the trends we can see in the violin and box plots from being seen in the average ROI scores (which is what the bar plot displayed).

As mentioned, I can understand why the distribution bands should lenghthen (or, as in the case of the barplots, why the bottom quartile should get longer). But why should the median ROIs increase as the credit grades increase?

There doesn't seem to be much difference in ROI at different income ranges. The 75%-of-values distributions do get a little wider at the higher ranges -- suggesting that more loans are closer to the median (a symptom of less risk loans: more loans turn out the way you predicted they would, which I guess is around a 10% return (since all the means are around 10% and I imagine that has to do with charging higher interest rates to riskier loans)).

Why don't the Credit Rating and Income Range violin plots look more alike? I guess because with more data than just a borrower's income range, Prosper makes better guesses about the loan outcomes and is better at predicting who they should lend to and how much they should charge them -- this makes the statistically richer plots (like box and violin plots) correspond better to Credit Grade than Income Range, even in rough times.

Before proceeding, I'd like to see how my two possible primary metric -- ROI and Succesful/Unsuccessful relate to each other.

NewLoanStatus vs ROI

This seems to be a fairly significant correlation. On average, the Successful loans have a 20% return and the Unsuccessful ones have a -40% return. Maybe my original primary metric was OK, after all. But ROI still seems better, as it gives more details about the loan outcomes.

CreditGrade vs NewLoanStatus & IncomeRange vs NewLoanStatus

For Credit Grade, the Unsuccessful loans increase pretty linearly from AA to A to B to C. The AA Successful loans are quite a bit higher than the A; but the Successful loans progress pretty linearly from A to B to C.

Similar trends can be found in the medium three income ranges.

I think this data would be more useful if the Successful and Unsuccesful loans were measured as percentages of the total loans for each Credit Grade or Income Range.

Making the NewLoanStatus variable numeric and using it to compare the percent of successful or unsuccessful loans at different Credit Grades

That shows a fairly clear correlation.

I'll try the same with Income Ranges

60% success rate for \$0 income. Something like 62% for the newt two ranges. And then up to maybe 65% for both the \\$50K-\$74K and the \\$75K-\$99K ranges. And then up another percentage point or so for the highest income range. That's a trend, although not as strong as I would've predicted. And why would the unemployed be the best at paying off their loans??

This also does not agree with the previous map of Gredit Grade vs ROI.

StatedMonthlyIncome vs ROI

There's some crazy outliers. First I'll see what happens if we zoom in to the bulk of the data.

Past 25,000 there are proportionately very few values.

Let's see what happens if we zoom in further.

Zooming in seems to make the trend I initially noticed -- a little downward slope in ROIs as incomes go up -- disappear. Let's try a log tansformation.

There seems to be a trend -- the ROIs increase as we go up from a couple hundred to thousands of dollars a month; and then start sloping down again a little before \$10K a month.

Let's see what Pearson says.

According to Pearson, Stated Montly Income is not a good predictor of ROI.

What would happen if we do a Pearson evaluation on the section with most of the data -- from say \$300 to \\$30,000 a month?

Still not a significant correlation. What if I honed into more typical incomes, say \$1K to \\$10K per month?

I am disappointed that so far of the possible predictor variables I studied, only Credit Grade is looking like a very good predictor. So I will add in the CreditScoreRangeUpper, which I believe must be correlated with outcomes. (CreditScoreRangeLower should be too. I have no particular reason for choosing Upper over Lower. When I run .corr() tests on a predictor value versus first one and then the other, I get identical scores; so I assume they are for our purposes here essentially interchangeable.)

The middle Credit Ranges show mixed results; the lower scores are uniformly negative average ROIs. Only up around upper credit ranges of about 800 does Prosper start to turn a solid profit. Below the 579 credit rating even their average ROI is consistently a loss of 10% or more.

This doesn't seem to fit with our investigations of Credit Grade and ROI -- even the lowest Credit Grades had positive ROIs.

Let's run the correlation tests.

What happens if we remove those few low scores?

Let's go over Credit Score Range versus Credit Grade and versus StatedMonthlyIncome

Credit Grade vs Credit Score Range (Upper & Lower)

The lowest credit score range is NC, then there's several HR. 540 is the only range with two credit grades (HR and E).

In general, the lower Credit Grades are in the lower score ranges and the higher CGs are in the higher CSRs -- as you'd think would be the case.

This also explains the apparent discrepancy between our ROI versus CG and our our ROI versus CSRU findings -- the reason that CreditGrade has uniformly positive ROIs is that the lowest score E correlates to a CSRU of at least 540. The really low CSRU scores (with the uniformly negative ROIs) don't even get letter rating.

Credit Score Range versus Stated Monthly Income

There looks like a slight but meaningful relationship. As income goes up, Credit Range goes up too.

Spearman correlations with categorical variables

See the bottom of this section for a chart with spearman and pearson scores greater than |.1|

Let's run computational correlation tests on some categorical variables versus ROI (numeric) and Successful/Unsuccessful (categoric).

Spearman is interpreted like this:

.00-.19 “very weak” / .20-.39 “weak” / .40-.59 “moderate” / .60-.79 “strong” .80-1.0 “very strong”

So this is a very weak correlation.

Let's just see what happens if we try a pearson.

OK, well, that wasn't supposed to work anyway.

I'm not having much luck finding a good predictor.

Credit Grade should've been a very good predictor.

If I can't get a meaningful correlation here, I think I have to conclude there is something wrong with my ROI (return on investment) metric.

Let's just see if we can do any better with CreditScoreRange. It's sliced up more finely than Credit Grade, so if ROI is at least on the right track, CSR should do better than the .14 that CG gave us.

I didn't think the spearman would work. But I thought the pearsons would've shown a clear correlation between my return on investment metric and CreditScoreRangeLower (which is a numeric value, and which should thus be able to line up linearly with roi, another numeric value).

Here again I was sorely disappointmed.

.1 is the bottom cut-off for a low pearson's correlation.

So .12 is definitely low. So low as to be meaningless? Probably. A medium correlation starts at .3

I'll just look at the Upper to see if it's any different.

CreditScoreRangeUpper and CreditScoreRangeLower have identical correlation scores with ROI. From this, and their obvious similarity to each other, I conclude that at least for my purposes here, CSRU and CSRL are functionally equivalent.

A .12 pearson score represents a weak correlation. And this .12 ROI/CSRU score is the highest score I've gotten when comparing ROI with any of the predictor values.

This makes me feel like my ROI is pretty useless. Surely the credit score ranges are a good predictor value for a loan's return. They are made by the lenders for the sole purpose of turning a profit and lenders don't generally go broke. The housing collapse wasn't until 2008. Most of the loans in this data set would've been up before then. So the aggregate of the loans should've turned out about as the credit scores predicted.

It seems that my ROI must be somehow wanting. Or perhaps the problems plaguing both Prosper and the larger economy during the time period under review are to blame in the apparent mismatch between credit scores and loan outcomes.

I'll see if my original metric -- Successful/Unsuccessful based on whether or not the loan was completed -- does any better.

OK, so at least my two metrics for measuring a loan's success correlate with each other.

Now let's find out if NewLoanStatus or ROI more closely correlates to the would-be predictors.

Try the same, but throw out the most extreme values.

Hmm, still no very meaningful correlation between Stated Monthly Income and the Successful/Unsuccessful metric.

Let's try honing in on a very typical income range.

OK, that didn't help.

Checking CG versus NLS

.2 is the very beginning of the 'weak' correlation range.

But the graphs earlier had shown that as the credit grades went down, the percent of completed loans also went down. Oh! Maybe I should've given the highest credit grades the highest numbers. That makes sense. If that's the case, and for AA to get the highest rank it needs the highest number, then I numbered the categories backwards, and I can conclude that this spearman result roughly matched my graph above: a very slight positive correlation.

Finally, let's see how CreditScoreRangeUpper compares to NLSNumeric.

Huh. What can we say but the fancy ROI metric didn't do nearly as well the simple Completed/Not-Completed metric. Clearly, the ROI metric is missing something. There must be some important part of the return on investment calculation that did not make it into my ROI calcultion. Either that, or I don't know what. Since return on investment really should correlate to credit score range and the other predictor variables.

Did I ever compare CreditScoreRangeUpper with Stated Monthly Income?

That doesn't make sense. CSRU correlates weakly but still meaningfully to both Stated Monthly Income and NewLoanStatus (.24 and .27 spearman scores, respectively); but Stated Monthly Income and NewLoanStatus have no meaningful correlation (.05 spearman score). How can that be?

Credit Score Range and Credit Grade should be closely correlated. Let's run a spearman on them. If that doesn't come out well, we know there is something wrong with the way we're doing computational correlations.

OK, that's reassuring. As noted above, with the way we created CreditGradeNumeric, a negative spearman score represents a positive correlation (because we gave the highest grades the lowest numbers). .98 is a very high correlation, and that is what we'd expect from these two variables (CG is basically just a less-detailed version of the credit score ranges).

For completeness, I'll run spearman tests on NewLoanStatus versus both DebtToIncomeRatio and AmountDelinquent.

Debt to Income wasn't even a |.1| correlation, but Amount Delinquent was |.15| -- a very weak corelation (but still a correlation).

Overview of .corr() results for outcomes greater than |.1|

Variables Result Type of Test
ROI / Credit Grade .138815 spearman
ROI / Credit Score Range .121623 pearson
NLS / Credit Grade .201647 spearman
NLS / Credit Score Range .265645 spearman
NLS / Amount Delinquent - .150028 spearman
ROI / NLS .722029 spearman
Stated Monthly Income / Credit Score Range .240603 spearman
Stated Monthly Income / Credit Score Range .198644 pearson
Credit Grade / Credit Score Range .98771 spearman

NLS = NewLoanStatus = Successful/Unsuccessful based on whether or not the loan was completed.

ROI = Return on Investment (see calculation in the Prepare Data for Analysis section at the top of this notebook).

Whenever Credit Grade was one of the variables, the results here have been multiplied by -1 (because I created the CreditGradeNumeric variable backwards)

Spearman is interpreted like this:

.00-.19 “very weak" / .20-.39 “weak” / .40-.59 “moderate” / .60-.79 “strong” .80-1.0 “very strong”

Pearson is interpreted like this:

.1 - .3 "small" / .3 to .5 "medium" / .5 to 1 "large"

(Maybe both should be interpreted like this?)

Discussion of .corr() score results can be found in the intro to the multivariate plots section

ROI versus Borrower State, Occupation, Home Owner Status and Employment Status

There's two more categories to compare against ROI:

nominal_vars = ('BorrowerState', 'Occupation')

binary_vars = ('IsBorrowerHomeowner', 'EmploymentStatus')

I didn't spend too much time on these or notice any very interesting trends. Feel free to skip down to the multivariate section.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The ROI versus primary variable results were disappointing and confusing.

When graphed, the StatedMonthlyIncome looked like it correlated fairly well to ROI, but the pearson score was very low (.02 for all the data; .03 when we zoomed in on the bulk of the data).

Of the many variables describing the borrows that I examined, only CreditScore and CreditScoreRange seem to be good indicators of a loan's ROI score.

But why did I get such a trivial ROI/SMI score when the graph of ROI/SMI seemed to show a small but meaningful relationship, and when ROI/NLS, NLS/SMI, ROI/CSRU, and CSRU/SMI all had meaningful-looking .corr() correlation results?

The NewLoanStatus primary variable results were less disappointing, but the metric itself seems less meaningful because it doesn't account for as many variables as ROI. For example: NewLoanStatus only tells us whether or not a loan was completed, but sometimes an uncompleted loan still makes money,and the amount of money an uncompleted loan loses varies widely; that's why I calculated the return on investment variable in the first place.

NewLoanStatus had higher correlations with the credit-based variables (CreditGrade, CreditScoreRangeUpper and CreditScoreRangeLower) than ROI did. It also had a correlation with AmountDelinquent above the |.1| weak-correlation threshold. The ROI/AD correlation scores were well below |.1|

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Stated Monthly Income and Income Ranges were closely related -- as you'd think they should be. There was also a discernible relationship between Credit Score and Income Ranges, and between CreditScoreRange and CreditGrade. All as you'd expect.

Multivariate Exploration

To help choose the which multivariate plots to investigate, let's review our .corr() results.

Variables Result Type of Test
ROI / Credit Grade .138815 spearman
ROI / Credit Score Range .121623 pearson
ROI / Credit Score Range -.043215 spearman
NLS / Credit Grade .201647 spearman
NLS / Credit Score Range .265645 spearman
NLS / Amount Delinquent - .150028 spearman
ROI / NLS .722029 spearman
Stated Monthly Income / Credit Score Range .240603 spearman
Stated Monthly Income / Credit Score Range .198644 pearson
Credit Grade / Credit Score Range .98771 spearman

NLS = NewLoanStatus = Successful/Unsuccessful based on whether or not the loan was completed.

ROI = Return on Investment (see calculation in the Prepare Data for Analysis section at the top of this notebook).

Whenever Credit Grade was one of the variables, the results here have been multiplied by -1 (because I created the CreditGradeNumeric variable backwards)

Spearman is interpreted like this:

.00-.19 “very weak" / .20-.39 “weak” / .40-.59 “moderate” / .60-.79 “strong” .80-1.0 “very strong”

Pearson is interpreted like this:

.1 - .3 "small" / .3 to .5 "medium" / .5 to 1 "large"

The highest non-trivial correlations found with the Pearson and Spearman tests were ROI/NLS (spearman .723); NLS/Credit Score Range (spearman .266); Stated Monthly Income / Credit Score Range (spearman .241; pearson .198644); and NLS/Credit Grade (spearman .201647).

We could perhaps also add ROI/Credit Score Range (pearson .121623 -- that's a small correlation, but it is the best correlation that I found with ROI and any of the predictor variables.

I'll focus my multivariate explorations on these variables.

I don't understand why comparing ROI with the various predictor variables resulted in pearson scores so much lower than the spearman scores I got when comparing NLS to those same predictor variables.

Should pearson be weighted higher? Note that SMI/CSR got a Pearson score of .1986 and a Spearman score of .2406 That makes one think that the two methods are interchangeable when both variables are numeric, and that the Pearson's score should be weighted higher than Spearman's. However, then we have the case of ROI/CSR -- two numeric variables, graphs show a correlation, Pearson has a small but perhaps meaningful score of .12, but then Spearman comes in with a -.04! What are we to make of that? That's 1/3 the size of the Spearman (in the SMI/CSR example, the Spearman was .83 times as big as the Pearson).

It would be nice if the multivariate plots could give me some insight into why NLS got higher .corr() scores than ROI.

Of particular interest: Why did I get such a trivial ROI/SMI score when the graph of ROI/SMI seemed to show a small but meaningful relationship, and when ROI/NLS, NLS/SMI, ROI/CSRU, and CSRU/SMI all had meaningful-looking .corr() correlation results?

I don't really know how to approach these questions. I will try plots that combine the variables with the highest .corr() scores (ROI/NLS/SMI/CSR) and see what I get.

Here we see what appear to be clear trends. The Unsuccessful loans are much more likely to lose money. They have almost uniformly positive ROIs. Strangely, as the Stated Monthly Income increases, the ROI decreases, and there are less total successful loans.

How can I show what is going on behind the the orange part of the graph? I guess I could take a sample to reduce the number of plots.

We also have to do a log transformation on StatedMonthlyIncome.

We learned in the bivariate section that without a log transformation, StatedMonthlyIncome results in a halved graph.

There's a lot of empty space. Let's zoom in on the more typical income levels.

Very few of the Unsuccessful loans have positive ROIs. But a few do. These loans must've been paid off for a long time before the borrower finally quit.

The Successful loans also seem to have a wider monthly income range than the Unsuccessful loans.

When the ROI is between 0 and about .15, the incomes of the borrowers of the completed/Successful loans range from over \$10K down to less than \\$1K. That income range becomes narrower as the ROIs increase. The most succesful ROIs (from about .6 through about .7) seem to go to people making from about \$2K to \\$6K per month.

From -1 to 0, the Unsuccessful loans's borrowers tend to have incomes from about \$1K to \\$10K. As the ROIs get larger than 0, the Unsuccessful income range seems to narrow a bit, but there is no clear pattern.

By showing a definite trend in incomes as ROI increases, this graph (like the earlier bivariate ROI/SMI graph) seems to contradict the extremely low pearson score of ROI/SMI. But notice that the Unsuccessful loans have much more uniform incomes across the ROIs. Maybe the existence of a large uniform negative SMI/ROI relationship is obscuring the meaningful relationship between ROI and SMI for those loans that are successful?

Yes!, the Successful loans do indeed have a correlation about |.1|. But the correlation has gone from being a positive one (in loans_to_2008) to a negative one (in the Successful Loans subset).

ROI as a function of CSRU, with color-coded StatedMonthlyIncome

The existence of extreme salary outliers means that all the standard salaries are scrunched down at the bottom in the light greens. Let's see what happens when we remove the outliers.

Positive ROI slopes up from CSRUs of about 450 to an apex around CSRU 600 and then down again until CSRU 900. Negative ROIs are pretty much a solid block from about CSRU 450 to CSRU 850.

The graph gets a little darker (meaning incomes are going up) as you move towards higher credit ratings.

There are dark plots on both positive and negative sides of the ROI, so it is hard to see much effect of income on ROI.

ROI as a function of SMI, with color-coded CSRU

Zoom in:

Is this graph easier or harder to read than the first one?

Note that the values with negative ROIs have fairly uniform SMIs, and that the SMIs of the positive ROIs correlate much more meaningfully to changes in ROI. This, again, is probably related to the fact that higher incomes tend to have slightly lower ROIs.

Note also that the most discernable darkening (indicating hightest credit ratings) is along the bottom of the positive section of ROI -- this we've seen in other graphs: the highest CSRUs correspond to lower ROIs.

ROI as a function of SMI, with CSRU color-coded -- as two graphs: Successful vs Unsuccessful

Let's combine the two above sets of multivariate explorations into two side-by side graphs that compare ROI, SMI and CSRU in two cases: Successful or Unsuccesful (ie: completed or uncompleted) loans.

Both graphs increase the ROI as SMI goes up to about \$6K, and then the ROIs slope down. But both the upward and downward slope is much more gradual on the Successful loans. Also, Unsuccessful loans represent pretty much all ROIs (even ones with very high ROIs!), but Successful loans have pretty much only positive ROIs.

Note that the darkest bands in the Successful graph are between ROIs of 0 and .1, and that both graphs tend to get darker as you move to the right. The first phenomenon we've previously attributed to Prosper giving people with better credit scores better rates (meaning Prosper makes less money on those Successful loans). The second phenomenon we should be able to attribute to an increase in credit scores as incomes go up. I'm only surprised that that latter trend is only perceived very slightly in these graphs.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

ROI as a function of SMI, with color-coded Successful/Unsuccessful loan status:

Very few of the Unsuccessful loans have positive ROIs. But a few do. These loans must've been paid off for a long time before the borrower finally quit.

The Successful loans also seem to have a wider monthly income range than the Unsuccessful loans.

By showing a definite trend in incomes as ROI increases, this graph (like the earlier bivariate ROI/SMI graph) seemed to contradict the extremely low pearson score of ROI/SMI.

But with the multivariate graph, I noticed that the Unsuccessful loans have much more uniform incomes across the ROIs and speculated that perhaps the existence of a large uniform negative SMI/ROI relationship was obscuring the meaningful relationship between ROI and SMI.

I removed the Unsuccessful loans from the data set, ran a Pearson test, and found a slight negative (about -.1) correlation between ROI and SMI in the Successful loans.

ROI as a function of CSRU, with color-coded StatedMonthlyIncome

Positive ROI slopes up from CSRUs of about 450 to an apex of around CSRU 600; the postive ROIs then slope down again until CSRU 900. Negative ROIs are pretty much a solid block from about CSRU 450 to CSRU 850.

The graph gets a little darker (meaning incomes are going up) as you move towards higher credit ratings.

There are dark plots on both positive and negative sides of the ROI, so it is hard to see much affect of income on ROI.

ROI as a function of SMI, with color-coded CSRU

The values with negative ROIs have fairly uniform SMIs, and the SMIs of the positive ROIs correlate much more meaningfully to changes in ROI. For positive ROIs, the ROIs go up from incomes of about \$1K/month until peaking at about ROI .7 with incomes around \\$6K/month, and then slope down again.

The most discernable darkening (indicating hightest credit ratings) is along the bottom of the positive section of ROI -- this we've seen in other graphs: the highest CSRUs correspond to lower ROIs.

ROI as a function of SMI, with CSRU color-coded -- as two graphs: Successful vs Unsuccessful

In both the Successful and Unsuccessful graphs, the ROIs increase as SMI goes up to about $6K; and then the ROIs slope down. But both the upward and downward slope is much more gradual on the Successful loans.

Also, Unsuccessful loans represent pretty much all ROIs (even ones with very high ROIs!), but (of course) Successful loans have pretty much only positive ROIs (it makes sense that only very few uncompleted loans result in a profit for the lender).

Why do the Successful and Unsuccessful loans both see increases in their ROIs as incomes go up to about \$6K, and then slope down? And why are both the upward and downward slopes much more gradual (starting at lower and ending at higher incomes) in the graph of Successful loans? And note that that trend in the Unsuccessful loans holds not just in the positive portion (the "successful" portion of the Unsuccessful loans), but also in the negative ("unsuccessful" portion).

Does the appearance of a pattern in the isolated Unsuccessful Loans ROI versus SMI plot contradict our earlier hypothesis that the large number of relatively uniform negative-ROI values (ie: a rectangle with no discernable pattern between ROI and SMI) obscured the ROI/SMI correlation that is visible in the postive-ROI portion of the graph? (Recall that the Successful loans have a ROI/SMI pearson score above |.1|; but all the loans taken together didn't had ROI/SMI pearson score much below |.1|.)

I will run the pearson correlation on the Unsuccessful loans.

OK, so that's still a pretty meaningless correlation.

But it gave me the idea to see if the pearson correlations increased if I selected the income ranges from about \$300 through \\$6000 -- where they are increasing on the graph.

I don't think it should, because the pearson is supposed to track linear correlations, so it should be able to account for lines going first up and then down. But I just thought I'd see what happens.

Nope. No meaningful correlation.

Let's see what happens if we isolate just the successful loans.

That's weird. Still an insignificant correlation, but now in the negative direction. But in the first half of this graph in the Successful Loans subset, the ROIs increase as SMIs increase.

I can't explain this.

There is a clear positive linear relationship between SMI and ROI in the Successful Loan subset within the \$300 through \\$6000 income ranges. So why am I getting a tiny and negative pearson score over that same subset and range?

About the Color-Coded CSRU - how it evolved as SMI and ROI increased.

Note that the darkest bands in the Successful graph are between ROIs of 0 and .1, and that both graphs tend to get darker as you move to the right. The first phenomenon we've previously attributed to Prosper giving people with better credit scores better rates (meaning Prosper makes less money on those Successful loans). The second phenomenon we should be able to attribute to an increase in credit scores as incomes go up. I'm only surprised that that latter trend is only perceived very slightly in these graphs.

Variables Result Type of Test
ROI / Credit Grade .138815 spearman
ROI / Credit Score Range .121623 pearson
ROI / Credit Score Range -.043215 spearman
NLS / Credit Grade .201647 spearman
NLS / Credit Score Range .265645 spearman
NLS / Amount Delinquent - .150028 spearman
ROI / NLS .722029 spearman
Stated Monthly Income / Credit Score Range .240603 spearman
Stated Monthly Income / Credit Score Range .198644 pearson
Credit Grade / Credit Score Range .98771 spearman

NLS = NewLoanStatus = Successful/Unsuccessful based on whether or not the loan was completed.

ROI = Return on Investment (see calculation in the Prepare Data for Analysis section at the top of this notebook).

Whenever Credit Grade was one of the variables, the results here have been multiplied by -1 (because I created the CreditGradeNumeric variable backwards)

Spearman is interpreted like this:

.00-.19 “very weak" / .20-.39 “weak” / .40-.59 “moderate” / .60-.79 “strong” .80-1.0 “very strong”

Pearson is interpreted like this:

.1 - .3 "small" / .3 to .5 "medium" / .5 to 1 "large"

Were there any interesting or surprising interactions between features?

Bivariate observations:

  1. Credit Score Range (Upper) and (Lower) seem functionally equivalent, as far as their relationships to ROI and NLS go.

  2. Graphically, Credit Score Range is correlated with gradually decreasing peak-ROIs from its peak/start at CSRU 600 / ROI .75 down through the low/end of CSRU 900 / ROI .1 I attribute this to the fact that people with higher credit scores receive better lending rates.

  3. Graphically, Credit Score Range is correlated with clearly negative ROIs for CSRs less than 600, and for clearly positive ROIs for CSRs greater than 800. But between CSR 600 and 800 no clear pattern is discernable: the ROIs are all near 0, but there's no visible direction (ie: the ROIs don't generally get larger or smaller as the CSRs get larger, but instead the ROIs fluctuate a little above or below 0). From this, it seems that Prosper did a poor job of off-setting the higher risks inherent in lending to people with worse credit track records during the time frame we investigated. To be profitable when lending to people with lower credit scores, they would've had to be more careful about who they loaned to and/or set higher rates for those people.

Here we should note that Prosper Marketplace is a peer-to-peer lending marketplace. They match borrowers with lenders. The period we're investigating is 2005 through 2008. From 2006 through 2008, Prosper let borrowers and lenders set their own rates. During this period, a relatively high percentage of Prosper's loans resulted in default. In November 2008 the SEC found Prosper in violation of the Securities Act of 1933 and decided that from now on all peer-to-peer loans would be classified as securities and all peer-to-peer lending marketplaces would have to register with the SEC. Starting in 2009, Prosper began complying with the new SEC oversight and also started setting the lending rates (so now lenders and borrowers just had to decide whether or not they would accept the rates set by Prosper's algorithms). The results were that many less loans were in default.

The time frame we investigated also coincides pretty closely with the 2007-2008 financial crisis.

Prosper's lending practices and the global economic crisis can probably at least partically account for the relatively poor predictive value of CSRU -- the best predictor variable I found. This, anyway, is my ver non-expert guess.

  1. Mathematically, the Pearson CSRU versus ROI score was .12 -- a slight positive correlation. But spearman was -.02, a miniscule negative correlation. Observationally, there is a slight positive linear correlation. Why should spearman result in a negative correlation?

  2. It is not surprising that the two Credit Score Ranges correlated closely both visually and mathematically to Credit Score.

  3. It was encouraging that there was a high correlation both visually and mathematically between ROI and NLS (New Loan Status: a variable measuring whether or not a loan was completed successfully). Otherwise, one or both of the primary metrics would've been very suspect.

  4. The best predictor-primary pairs I found were NLS/SMI (); NLS/CSRU (spearman .27) and NLS/CreditGrade (spearman .20). For ROI, the best predictor-primary pair was ROI/CG (spearman .14), followed by ROI/CG (pearson .12; spearman -.04)

  5. The only other predictor-primary pair with a score above .1 was NLS/AmountDelinquent (spearman -.15)

  6. The best predictor/predictor pair I found was CSRU/CG (spearman .99); but since I assume Credit Grade is made from Credit Score Range, this result was pretty trivial. The second best predictor/predictor score I found was SMI/CSRU CSRU/StatedMonthlyIncome (spearman .24; pearson .20).

  7. An abiding mystery: Why were the SMI/ROI and SMI/NLS correlation scores so low? SMI related meaningfully to CSRU and CSRU related meaningfully to both ROI and NLS. But the SMI/NLS and SMI/ROI scores were very low (SMI/NLS: spearman .04; pearson .06) (SMI/ROI: pearson .02; spearman .003).

  8. Another question: Why do the medians in the ROI versus Credit Range box plots increase slightly as we go from AA through D? Even as the average ROIs show no clear trend, beyond the relative success of AA loans compared to the relative failure of the other loans.

  9. Another question: Why do the medians in the ROI versus Income Range box plots hold pretty steady across all income ranges?

Multivariate observations:

  1. Why are there Unsuccessful loans at pretty much any ROI (even very high ones like .75)? The Unsuccessful loans either defaulted or were charged off. How could there be any defaulted or charged-off loans with peak ROIs (.75 is about as high as the Successful loan ROIs ever get)?

  2. All the correlation tests for SMI and ROI resulted in scores significantly below |.1| when we examined loans_to_2008 -- even when we removed outlier values. But when I separated out the Successful loans (almost all of which had positive ROIs) from the Unsuccessful ones (the vast majority of which had negative ROIs), I found that within the Successful loans subcategory, the ROI/SMI score was -.1, but within the Unsuccessful subcategory, the score was .03. What can we conclude from this?

  3. Why do the Successful and Unsuccessful loans both see increases in their ROIs as incomes go up to about \$6K, and then slope down? And why are both the upward and downward slopes much more gradual (starting at lower and ending at higher incomes) in the graph of Successful loans? And note that that trend in the Unsuccessful loans holds not just in the positive portion (the "successful" portion of the Unsuccessful loans), but also in the negative ("unsuccessful" portion).

A final Refinement

I decided to remove the points outside of the main shape (income ranges over \$30K and less than \\$300), and then do another analysis on ROI versus SMI, ROI versus CSRU, and SMI versus CSRU in the Successful Loans and Unsuccesful Loans subsets.

Since the Successful/Unsuccessful subcategorizing had yielded encouraging results with the full data set, I thought that doing the same thing but without the most extreme outliers would give me the best chance of finding significant correlations without materially altering the data.

I began by redoing the graph, and including lines that show the means of the x (SMI) and y (ROI) variables.

The blue lines are the means within typical income ranges.

If the top left and bottom right quadrants have the most observations, there is a negative linear trend. That's what seems to be happening in the Successful graph.

If the bottom left and top right quadrants have the most observations, there is a positive linear trend.

You can skip past the below calculations to a chart with the results of the various correlation tests.

Pearson and Spearman Results

First a review of the results from the full data set:

Variables Pearson Spearman
ROI / Credit Grade NA .138815
ROI / Credit Score Range .121623 -.043215
NLS / Credit Grade NA .201647
NLS / Credit Score Range NA 265645
NLS / Amount Delinquent NA -.150028
ROI / NLS NA .722029
Stated Monthly Income / Credit Score Range .198644 .240603
Credit Grade / Credit Score Range NA .98771

When I removed all Stated Monthly Incomes under \$300 or over \\$30K for the Successful Loans and Unsuccessful Loans subsets, the correlation scores got a little better than in the initial full data set.

The pared-down data set (called "typical_incomes", even though \$30K/month isn't very typical) also got larger correlation scores when broken into Successful and Unsuccessful loans than those subsets had received with the full data set (the full data set I studied here: loans originating from 2005 through 2008).

The Successful Loan subset of the "typical_incomes" portion of the loans_to_2008 dataset appears to be the best correlations I can get with these investigations. At least of the manipulations that I have been able to think of.

The biggenst news with the pared-down data set broken into Successful and Unsuccessful subsets, is that we finally got a ROI/SMI that was not considerably below |.1|:

Succesful subset -.14123 pearson and -.124804 spearman (Unsuccessful subset still quite a bit below .1).

The Successful ROI/CSRU was quite significant: -.562447 pearson and -.570404 spearman.

And the SMI/CSRU score was signficant for both the Succesful (.252842 pearson and .25707 spearman) and Unsuccessful (.252145 pearson and .34411 spearman) subsets.

With the largest outliers removed, and the data broken into Successful Loans and Unsuccessful Loans subsets, Stated Monthly Income appears to be a fair predictor of ROI (Return on Investment) for the Successful loans, with a pearson score of -.14 and a spearman of -.12. Within that same Successful subset, CSRU appears to be a very good predictor of ROI (pearson -.562447 and spearman -.570404). Why the negative correlation? Why should ROI decrase when either SMI or CSRU increase? Remember that we are within the Successful subset, so almost all these loans have positive ROIs. People with lower incomes tend to have lower credit ratings, so they are charged a higher rate; therefore, if they pay off their debt, the loan company makes more money than if someone with a higher credit rating pays off their debt.

Of course, in practice, preclassifying the loans into Successful and Unsuccessful ones is impossible. The whole point of the predictor variables was to help us predict ahead of time how well the loans would do.

The moderate correlation between Stated Monthly Income and Credit Score Range Upper (about .26 for both pearsons and the Succseful Loans subset's spearman; and .34 spearman in the Unsuccessful Loans subset) helps to explain how it is possible for CSRU to track so much better to ROI than SMI. (If SMI and CSRU were highly correlated, it would be very surprising if CSRU was a much better predictor of ROI than SMI.)

Why did all tests of the Successful Loans subset yield scores well over |.1|, but only three of the Unsuccessful Loans subsets did: ROI/CSRU spearman (.1), and SMI / CSRU pearson (.25) and spearman (.34)? Is this related to the fact that the Successful Loans subset's ROI values are almost all between 0 and .75, but the Unsuccessful Loan subset's ROI values range from -1 through to .75? What is going on there?

Pearsons and Spearmans Tests on the typical_incomes dataframe:

Below I ran Pearsons and Spearmans Correlation tests on the typical_incomes dataframe, and then on the Successful and Unsuccessful subsets of the typical_incomes dataframe.

I don't know why the correspondence coefficients were a little different from the table above. The below tests rounded to only three digits (instead of the six digits above), but the difference seems a little bigger than that (when I round the six digit numbers up to three digit numbers, I don't get the exact numbers as the below results).

In any case, every p-value came up 0, except for ROI/SMI spearman (coefficient was -.003; p-value was .584).

What does a zero p-value mean in this case? that it is statistically impossible for these points to appear in this order without some underlying mathematical relationship existing between them? IE: It is impossible that these two variables could relate to each other in this way without there being a mathematical correspondence between them? Really? How now? I thought anything was possible. But maybe a vanishingly small p-value gets rounded up to zero, and that's the story here.

Correspondence tests on full dataframe

Correspondence Tests on Successful Loans subset

Correspondence Tests on Unsuccessful Loans subset